Support `auto_doctring` in Processors by yonigozlan · Pull Request #42101 · huggingface/transformers

yonigozlan · 2025-11-07T17:31:21Z

What does this PR do?

Add support for Processors in @auto_docstring, and many other improvements to auto_docstring.py and check_docstrings.py, including more robust auto-fix with check_docstrings for missing, redundant, or unnecessary docstrings.

For processors, auto_docstring will pull custom args docstrings from custom "Kwargs" TypeDicts and add them to the .__doc__. For example, for processing_aria, we have:

class AriaImagesKwargs(ImagesKwargs, total=False):
    """
    split_image (`bool`, *optional*, defaults to `False`):
        Whether to split large images into multiple crops. When enabled, images exceeding the maximum size are
        divided into overlapping crops that are processed separately and then combined. This allows processing
        of very high-resolution images that exceed the model's input size limits.
    max_image_size (`int`, *optional*, defaults to `980`):
        Maximum image size (in pixels) for a single image crop. Images larger than this will be split into
        multiple crops when `split_image=True`, or resized if splitting is disabled. This parameter controls
        the maximum resolution of individual image patches processed by the model.
    min_image_size (`int`, *optional*):
        Minimum image size (in pixels) for a single image crop. Images smaller than this will be upscaled to
        meet the minimum requirement. If not specified, images are processed at their original size (subject
        to the maximum size constraint).
    """

    split_image: bool
    max_image_size: int
    min_image_size: int


class AriaProcessorKwargs(ProcessingKwargs, total=False):
    images_kwargs: AriaImagesKwargs

    _defaults = {
        "text_kwargs": {
            "padding": False,
            "return_mm_token_type_ids": False,
        },
        "images_kwargs": {
            "max_image_size": 980,
            "split_image": False,
        },
        "return_tensors": TensorType.PYTORCH,
    }


@auto_docstring
class AriaProcessor(ProcessorMixin):
    ...

    @auto_docstring
    def __call__(
        self,
        text: Union[TextInput, PreTokenizedInput, list[TextInput], list[PreTokenizedInput]],
        images: Optional[ImageInput] = None,
        **kwargs: Unpack[AriaProcessorKwargs],
    ) -> BatchFeature:
        r"""
        Returns:
            [`BatchFeature`]: A [`BatchFeature`] with the following fields:
            - **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
            - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
            `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
            `None`).
            - **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
            - **pixel_mask** -- Pixel mask to be fed to a model. Returned when `images` is not `None`.
        """
        ...

which results in the following docstring:

print(AriaProcessor.__call__.__doc__)

        Args:
            text (`Union[str, list, list]`):
                The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
                (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
                `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
            images (`Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, list, list, list]`, *optional*):
                Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
                passing in images with pixel values between 0 and 1, set `do_rescale=False`.
            split_image (`bool`, *optional*, defaults to `False`):
                Whether to split large images into multiple crops. When enabled, images exceeding the maximum size are
                divided into overlapping crops that are processed separately and then combined. This allows processing
                of very high-resolution images that exceed the model's input size limits.
            max_image_size (`int`, *optional*, defaults to `980`):
                Maximum image size (in pixels) for a single image crop. Images larger than this will be split into
                multiple crops when `split_image=True`, or resized if splitting is disabled. This parameter controls
                the maximum resolution of individual image patches processed by the model.
            min_image_size (`int`, *optional*):
                Minimum image size (in pixels) for a single image crop. Images smaller than this will be upscaled to
                meet the minimum requirement. If not specified, images are processed at their original size (subject
                to the maximum size constraint).
            return_tensors (`str` or [`~utils.TensorType`], *optional*):
                If set, will return tensors of a particular framework. Acceptable values are:

                - `'pt'`: Return PyTorch `torch.Tensor` objects.
                - `'np'`: Return NumPy `np.ndarray` objects.
        Returns:
            [`BatchFeature`]: A [`BatchFeature`] with the following fields:
            - **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
            - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
            `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
            `None`).
            - **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
            - **pixel_mask** -- Pixel mask to be fed to a model. Returned when `images` is not `None`.

…asses

…rom-processors

… (temporarily)

…rom-processors

…m/yonigozlan/transformers into remove-attributes-from-processors

…rom-processors

* Super * Super * Super * Super --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

* detectron2 - part 1 * detectron2 - part 2 --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

…ingface#41978) fix autoawq[kernels] Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

…ng-in-processor

HuggingFaceDocBuilderDev · 2026-01-06T16:45:17Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

github-actions · 2026-01-06T18:53:29Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: align, altclip, aria, aya_vision, bamba, bark, blip, blip_2, bridgetower, bros, chameleon, chinese_clip, clap, clip, clipseg, clvp

yonigozlan · 2026-01-06T20:23:43Z

Also Cc @stevhliu :)

Cyrilvallez

Trusting you on that, but I think it would be time to add some proper tests no? I see a very old test_auto_docstrings.py but that does not run any tests -> probably a very nice idea to start rewriting it!

Cyrilvallez · 2026-01-07T09:40:04Z

src/transformers/utils/auto_docstring.py

+    intro = f"""Constructs a {class_name} which wraps {components_text} into a single processor.
+
+[`{class_name}`] offers all the functionalities of {classes_text}. See the
+{classes_text_short} for more information.
+"""


nit: can we use textwrap.dedent here, so that the string respects the function indentation?

Yep it's done right after

Humm, I don't see it 😅 I meant doing something like

intro = textwrap.dedent( """ bla bla more bla """ ).strip()

so that the indentation stays inside the function

yonigozlan · 2026-01-07T17:40:09Z

Trusting you on that, but I think it would be time to add some proper tests no? I see a very old test_auto_docstrings.py but that does not run any tests -> probably a very nice idea to start rewriting it!

Yes clearly! I'll add tests in the next autodocstring PR ;)

…ng-in-processor

stevhliu

nice, thanks! added a few nits to the parameter definitions :)

stevhliu · 2026-01-07T18:19:03Z

src/transformers/utils/auto_docstring.py

+
+    chat_template = {
+        "description": """
+    A Jinja template which will be used to convert lists of messages in a chat into a tokenizable string.


Suggested change

A Jinja template which will be used to convert lists of messages in a chat into a tokenizable string.

A Jinja template to convert lists of messages in a chat into a tokenizable string.

stevhliu · 2026-01-07T18:22:00Z

src/transformers/utils/auto_docstring.py

+    The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
+    (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
+    `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).


Suggested change

The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings

(pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set

`is_split_into_words=True` (to lift the ambiguity with a batch of sequences).

The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings

(pretokenized string). If you pass a pretokenized input, set `is_split_into_words=True` to avoid ambiguity with batched inputs.

stevhliu · 2026-01-07T18:24:40Z

src/transformers/utils/auto_docstring.py

+    """,
+    }
+
+    audio = {


just curious, whats the difference between audio and audios below it?

I think audios is deprecated but still present in some places

stevhliu · 2026-01-07T18:26:58Z

src/transformers/utils/auto_docstring.py

+    pad_to_multiple_of = {
+        "description": """
+    If set will pad the sequence to a multiple of the provided value. Requires `padding` to be activated.
+    This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability


Suggested change

This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability

This is especially useful to enable using Tensor Cores on NVIDIA hardware with compute capability

stevhliu · 2026-01-07T18:29:24Z

src/transformers/utils/auto_docstring.py

+    add_special_tokens = {
+        "description": """
+    Whether or not to add special tokens when encoding the sequences. This will use the underlying
+    `PretrainedTokenizerBase.build_inputs_with_special_tokens` function, which defines which tokens are


Suggested change

`PretrainedTokenizerBase.build_inputs_with_special_tokens` function, which defines which tokens are

[`PretrainedTokenizerBase.build_inputs_with_special_tokens`] function, which defines which tokens are

stevhliu · 2026-01-07T18:30:23Z

src/transformers/utils/auto_docstring.py

+    list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized),
+    you must set `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).


Suggested change

list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized),

you must set `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).

list of strings (pretokenized string). If you pass pretokenized input, set is_split_into_words=True to avoid ambiguity with batched inputs.

stevhliu · 2026-01-07T18:30:44Z

src/transformers/utils/auto_docstring.py

+    list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized),
+    you must set `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).


Suggested change

list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized),

you must set `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).

list of strings (pretokenized string). If you pass pretokenized input, set is_split_into_words=True to avoid ambiguity with batched inputs.

…ng-in-processor

…om/yonigozlan/transformers into support-auto_doctring-in-processor

…ng-in-processor

* remove attributes and add all missing sub processors to their auto classes * remove all mentions of .attributes * cleanup * fix processor tests * fix modular * remove last attributes * fixup * fixes after merge * fix wrong tokenizer in auto florence2 * fix missing audio_processor + nits * Override __init__ in NewProcessor and change hf-internal-testing-repo (temporarily) * fix auto tokenizer test * add init to markup_lm * update CustomProcessor in custom_processing * remove print * nit * fix test modeling owlv2 * fix test_processing_layoutxlm * Fix owlv2, wav2vec2, markuplm, voxtral issues * add support for loading and saving multiple tokenizer natively * remove exclude_attributes from save_pretrained * Run slow v2 (huggingface#41914) * Super * Super * Super * Super --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> * Fix `detectron2` installation in docker files (huggingface#41975) * detectron2 - part 1 * detectron2 - part 2 --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> * Fix `autoawq[kernels]` installation in quantization docker file (huggingface#41978) fix autoawq[kernels] Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> * add support for saving encoder only so any parakeet model can be loaded for inference (huggingface#41969) * add support for saving encoder only so any decoder model can be loaded Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com> * use convolution_bias * convert modular * convolution_bias in convertion script --------- Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com> Co-authored-by: Eustache Le Bihan <eulebihan@gmail.com> Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com> * Use indices as position_ids in modernebert (huggingface#41789) * Use indices as position_ids in modernebert * Move position_ids init to the branch * test tensor parallel: make tests for dense model more robust (huggingface#41968) * make test forward and backward more robust * refactor compile part of test tensor parallel * linting * pass rank around instead of calling it over and over * Run slow v2 (huggingface#41914) * Super * Super * Super * Super --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> * Fix `detectron2` installation in docker files (huggingface#41975) * detectron2 - part 1 * detectron2 - part 2 --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> * Fix `autoawq[kernels]` installation in quantization docker file (huggingface#41978) fix autoawq[kernels] Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> * add support for saving encoder only so any parakeet model can be loaded for inference (huggingface#41969) * add support for saving encoder only so any decoder model can be loaded Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com> * use convolution_bias * convert modular * convolution_bias in convertion script --------- Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com> Co-authored-by: Eustache Le Bihan <eulebihan@gmail.com> Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com> --------- Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com> Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com> Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> Co-authored-by: Nithin Rao <nithinrao.koluguri@gmail.com> Co-authored-by: Eustache Le Bihan <eulebihan@gmail.com> Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com> * fix: dict[RopeParameters] to dict[str, RopeParameters] (huggingface#41963) * docs: add continuous batching page (huggingface#41847) * docs: add continuous batching page * docs(cb): add `generate_batch` example * docs(cb): add `opentelemtry` and `serving` section * feat: add `TODO` note about opentelemetry dependency * docs(cb): add supported features * docs(cb): add unsupported features * docs(cb): add `ContinuousBatchingManager` example * docs(cb): x reference CB in optimizing inference * Fix `torchcodec` version in quantization docker file (huggingface#41988) check Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> * [kernels] Add Tests & CI for kernels (huggingface#41765) * first commit * add tests * add kernel config * add more tests * add ci * small fix * change branch name * update tests * nit * change test name * revert jobs * addressing review * reenable all jobs * address second review * Move the Mi355 to regular docker (huggingface#41989) * Move the Mi355 to regular docker * Disable gfx950 compilation for FA on AMD * More data in benchmarking (huggingface#41848) * Reduce scope of cross-generate * Rm generate_sall configs * Workflow benchmarks more * Prevent crash when FA is not installed * fix (CI): Refactor SSH runners (huggingface#41991) * Change ssh runner type * Add wait step to SSH runner workflow * Rename wait step to wait2 in ssh-runner.yml * Remove wait step from ssh-runner.yml Removed the wait step from the SSH runner workflow. * Update runner type for single GPU A10 instance * Update SSH runner version to 1.90.3 * Add sha256sum to ssh-runner workflow * Update runner type and remove unused steps * fix 3 failed test cases for video_llama_3 model on Intel XPU (huggingface#41931) * fix 3 failed test cases for video_llama_3 model on Intel XPU Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * update Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * adjust format Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * update code Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> --------- Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * Integrate colqwen2.5 using colqwen2 modelling code (huggingface#40600) * adding option for 2.5 * minor - arg in conversion script * getting started on modelling.py * minor - shouldve been using modular * adressing comments + fixing datatype/device _get method * minor * commiting suggestion Co-authored-by: Yoni Gozlan <74535834+yonigozlan@users.noreply.github.com> * docs + first test * ruff fix * minor fix * ruff fix * model fix * model fix * fine-grained check, with a hardcoded score from the original Hf implementation. * minor ruff * update tests values with CI hardware * adding 2.5 to conversion script * Apply style fixes --------- Co-authored-by: Sahil Kabir <sahilkabir@Sahils-MacBook-Pro.local> Co-authored-by: Yoni Gozlan <74535834+yonigozlan@users.noreply.github.com> Co-authored-by: yonigozlan <yoni.gozlan@huggingface.co> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> * Fixed wrong padding value in OWLv2 (huggingface#41938) * Update image_processing_owlv2_fast.py fixed padding value * fixed padding value * Change padding constant value from 0.5 to 0.0 * Fixed missed padding value in modular_owlv2.py --------- Co-authored-by: Yoni Gozlan <74535834+yonigozlan@users.noreply.github.com> * Fix `run slow v2`: empty report when there is only one model (huggingface#42002) fix Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> * [kernels] change import time in KernelConfig (huggingface#42004) * change import time * style * DOC Fix typo in argument name: pseudoquant (huggingface#41994) The correct argument name is pseudoquantization. Since there is no error on passing wrong arguments name (which is arguably an anti-pattern), this is difficult for users to debug. * Fix `torch+deepspeed` docker file (huggingface#41985) * fix * delete --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> * Correct syntax error in trainer.md (huggingface#42001) A comma is missing between two parameters in the signature of compute_loss function. * Reduce the number of benchmark in the CI (huggingface#42008) Changed how benchmark cfgs are chosen * Fix continuous batching tests (huggingface#42012) * Fix continuous batching tests * make fixup * add back `logging_dir` (huggingface#42013) * add back * Apply style fixes --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> * Fix issue with from pretrained and kwargs in image processors (huggingface#41997) * accept kwargs in image proc from_pretrained * only use kwargs that are in cls.valid_kwargs * remove specific logic for _from_auto * add image_seq_length to Images_kwargs for backward compatibility * fix missing image kwargs in pix2struct * Fix default image_rows and image_cols initialization in Idefics3 and SmolVLM processors (huggingface#41871) * Fix default image_rows and image_cols initialization in Idefics3 and SmolVLM processors * Fix default initialization of image_rows and image_cols in Idefics3 and SmolVLM processors * Add GLPNImageProcessorFast (huggingface#41725) * Add GLPNImageProcessorFast for torch backend * Address review feedback - Simplified to_dict() method - Keep tensors as torch instead of converting to numpy for heterogeneous shapes - Removed unnecessary shape guards in post_process_depth_estimation - Improved variable names (tgt -> target_size, d -> resized) - Removed unnecessary GLPNImageProcessorKwargs class * Address review feedback - Simplified to_dict() method - Keep tensors as torch instead of converting to numpy for heterogeneous shapes - Removed unnecessary shape guards in post_process_depth_estimation - Improved variable names (tgt -> target_size, d -> resized) - Removed unnecessary GLPNImageProcessorKwargs class * commits after 2nd review * Address all review feedback and add explicit batched test - Simplified to_dict() with descriptive variable names (d->output_dict) - Fixed resize operation: changed from crop to proper resize with interpolation - Added padding for heterogeneous batch shapes in both slow and fast processors - Fused rescale and normalize operations for efficiency - Improved all variable names (tgt->target_size, d->depth_4d->resized) - Added GLPNImageProcessorKwargs class in slow processor and imported in fast - Renamed test_equivalence_slow_fast to test_slow_fast_equivalence - Added explicit test_slow_fast_equivalence_batched test - All 20 tests passing * using padding from utils * simplify glpn image processor fast * fix docstring --------- Co-authored-by: yonigozlan <yoni.gozlan@huggingface.co> Co-authored-by: Yoni Gozlan <74535834+yonigozlan@users.noreply.github.com> * add fuyu fast image processors (huggingface#41817) * added fast processor for fuyu (huggingface#36978) * updated docs for fuyu model (huggingface#36978) * updated test_image_processing and image_processing_fuyu_fast * updated fuyu.md and image_processing_fuyu_fast (huggingface#36978) * updated test_image_processing_fuyu (huggingface#36978) * formatted image_processing_fuyu_fast and test_image_processing_fuyu (huggingface#36978) * updated tests and fuyu fast image processing (huggingface#36978) * Merge branch 'fuyu-fast-image-processors' of https://github.com/DeXtAr47-oss/transformers into fuyu-fast-image-processors * fixed format (huggingface#36978) * formatted files (huggingface#36978) * formatted files * revert unnecessary changes * clean up and process by group --------- Co-authored-by: yonigozlan <yoni.gozlan@huggingface.co> * [kernels] Fix XPU layernorm kernel (huggingface#41583) * fix * add comment * better fix * style * Update src/transformers/modeling_utils.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> --------- Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * [v5] Deprecate Text2Text and related pipelines (huggingface#41996) * Deprecate Text2Text and related pipelines * Try a restructure * make fixup * logging -> logger * [FPQuant] MXFP8 and MXFP4 backwards support (huggingface#41897) * FP-Quant backwards * fp-quant v0.3.0 docker * availability version bump * fp_quant==0.3.1 * fp_quant v0.3.2 * add working auto_docstring for processors * add auto_docstring to processors first part * add auto_docstring to processors part 2 * modifs after review * fully working auto_docstring and check_docstring with placeholder docstrings * Working check_docstrings for Typed dicts * Add recurring processor args to auto_docstring and add support for removing redundant docstring and placeholders * replace placeholders with real docstrings * fix copies * fixup * remove unwanted changes * fix unprotected imports * Fix unprotected imports * fix unprotected imports * Add __call__ to all docs of processors * nits docs --------- Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com> Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com> Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> Co-authored-by: Nithin Rao <nithinrao.koluguri@gmail.com> Co-authored-by: Eustache Le Bihan <eulebihan@gmail.com> Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com> Co-authored-by: Rémi Ouazan <83456801+remi-or@users.noreply.github.com> Co-authored-by: Ferdinand Mom <47445085+3outeille@users.noreply.github.com> Co-authored-by: Ryan Mullins <ryanmullins@google.com> Co-authored-by: Luc Georges <McPatate@users.noreply.github.com> Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com> Co-authored-by: Guillaume LEGENDRE <glegendre01@gmail.com> Co-authored-by: kaixuanliu <kaixuan.liu@intel.com> Co-authored-by: Sahil Kabir <66221472+sahil-kabir@users.noreply.github.com> Co-authored-by: Sahil Kabir <sahilkabir@Sahils-MacBook-Pro.local> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: James <67161633+gjamesgoenawan@users.noreply.github.com> Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com> Co-authored-by: Yacklin Wong <139425274+Yacklin@users.noreply.github.com> Co-authored-by: Matt <Rocketknight1@users.noreply.github.com> Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> Co-authored-by: MilkClouds <claude@maum.ai> Co-authored-by: ARAVINDHAN T <arvindhant01@gmail.com> Co-authored-by: Pritam Das <79273068+DeXtAr47-oss@users.noreply.github.com> Co-authored-by: Andrei Panferov <andrei@panferov.org>

yonigozlan and others added 30 commits October 15, 2025 15:47

remove attributes and add all missing sub processors to their auto cl…

f48a47b

…asses

remove all mentions of .attributes

d5d5c58

cleanup

dd505b5

fix processor tests

6a1448f

fix modular

a292900

remove last attributes

63a255d

fixup

ef73759

Merge remote-tracking branch 'upstream/main' into remove-attributes-f…

b5e8b2e

…rom-processors

fixes after merge

f14ff3c

fix wrong tokenizer in auto florence2

0306430

fix missing audio_processor + nits

01cb815

Override __init__ in NewProcessor and change hf-internal-testing-repo…

49ec906

… (temporarily)

Merge remote-tracking branch 'upstream/main' into remove-attributes-f…

7dd5682

…rom-processors

fix auto tokenizer test

946cc5c

add init to markup_lm

b0cb3e0

update CustomProcessor in custom_processing

3b9e846

remove print

53de7a4

Merge branch 'main' into remove-attributes-from-processors

93d2c4d

Merge remote-tracking branch 'upstream/main' into remove-attributes-f…

feeec28

…rom-processors

nit

4a6b080

Merge branch 'remove-attributes-from-processors' of https://github.co…

02402a0

…m/yonigozlan/transformers into remove-attributes-from-processors

fix test modeling owlv2

757e1f1

fix test_processing_layoutxlm

bf763b2

Fix owlv2, wav2vec2, markuplm, voxtral issues

0799a0a

Merge remote-tracking branch 'upstream/main' into remove-attributes-f…

bf1a4b6

…rom-processors

add support for loading and saving multiple tokenizer natively

e3f130d

remove exclude_attributes from save_pretrained

cc45a7e

Run slow v2 (huggingface#41914)

6b9e7c9

* Super * Super * Super * Super --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

Fix detectron2 installation in docker files (huggingface#41975)

0ccb0e3

* detectron2 - part 1 * detectron2 - part 2 --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

Fix autoawq[kernels] installation in quantization docker file (hugg…

1eeece5

…ingface#41978) fix autoawq[kernels] Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

yonigozlan added 2 commits January 6, 2026 16:34

replace placeholders with real docstrings

1ce14e9

Merge remote-tracking branch 'upstream/main' into support-auto_doctri…

0ee2c3f

…ng-in-processor

yonigozlan added 3 commits January 6, 2026 17:31

fix copies

22b29b8

fixup

8d5ffa8

remove unwanted changes

ab1f03b

yonigozlan changed the title ~~[WIP] Support auto_doctring in Processors~~ Support auto_doctring in Processors Jan 6, 2026

yonigozlan added 3 commits January 6, 2026 18:19

fix unprotected imports

525804c

Fix unprotected imports

852b458

fix unprotected imports

03d1cd3

Add __call__ to all docs of processors

22721cd

yonigozlan requested review from ArthurZucker and Cyrilvallez January 6, 2026 20:10

Cyrilvallez approved these changes Jan 7, 2026

View reviewed changes

Merge remote-tracking branch 'upstream/main' into support-auto_doctri…

b170599

…ng-in-processor

stevhliu approved these changes Jan 7, 2026

View reviewed changes

yonigozlan and others added 2 commits January 7, 2026 20:41

nits docs

b73220d

Merge branch 'main' into support-auto_doctring-in-processor

dcea25a

yonigozlan enabled auto-merge (squash) January 7, 2026 20:41

yonigozlan and others added 5 commits January 7, 2026 19:49

Merge branch 'main' into support-auto_doctring-in-processor

80b849f

Merge remote-tracking branch 'upstream/main' into support-auto_doctri…

edae136

…ng-in-processor

Merge branch 'support-auto_doctring-in-processor' of https://github.c…

14a5070

…om/yonigozlan/transformers into support-auto_doctring-in-processor

add flaky test

b3bf0e3

Merge remote-tracking branch 'upstream/main' into support-auto_doctri…

d639cd9

…ng-in-processor

yonigozlan merged commit c8bc4de into huggingface:main Jan 8, 2026
25 checks passed

vasqu mentioned this pull request Jan 8, 2026

[Timm] Increase tol in flaky test #43173

Closed

	A Jinja template which will be used to convert lists of messages in a chat into a tokenizable string.
	A Jinja template to convert lists of messages in a chat into a tokenizable string.

	This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability
	This is especially useful to enable using Tensor Cores on NVIDIA hardware with compute capability

	`PretrainedTokenizerBase.build_inputs_with_special_tokens` function, which defines which tokens are
	[`PretrainedTokenizerBase.build_inputs_with_special_tokens`] function, which defines which tokens are

		list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized),
		you must set `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).

Conversation

yonigozlan commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Jan 6, 2026

Uh oh!

github-actions bot commented Jan 6, 2026

Uh oh!

yonigozlan commented Jan 6, 2026

Uh oh!

Cyrilvallez left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Cyrilvallez Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yonigozlan commented Jan 7, 2026

Uh oh!

stevhliu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

yonigozlan commented Nov 7, 2025 •

edited

Loading

Cyrilvallez Jan 8, 2026 •

edited

Loading